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Abstract — We propose a principled algorithm for robust 
Bayeslan filtering and smoothing In nonlinear stochastic dynamic 
systems when both the transition function and the measurement 
function are described by non-parametric Gaussian process 
(GP) models. GPs are gaining Increasing Importance in signal 
processing, machine learning, robotics, and control for rep- 
resenting unknown system functions by posterior probability 
distributions. This modern way of "system Identification" Is more 
robust than finding point estimates of a parametric function 
representation. In this article, we present a principled algorithm 
for robust analytic smoothing In GP dynamic systems, which 
are Increasingly used in robotics and control. Our numerical 
evaluations demonstrate the robustness of the proposed approach 
in situations where other state-of-the-art Gaussian filters and 
smoothers can fall. 

Index Terms — Nonlinear systems, Bayeslan Inference, Smooth- 
ing, Gaussian processes, Machine learning 



I. Introduction 

Filtering and smoothing in context of dynamic systems 
refers to a Bayesian methodology for computing posterior 
distributions of the latent state based on a history of noisy 
measurements. This kind of methodology can be found, e.g., in 
navigation, control engineering, robotics, and machine learn- 
ing [l]-[4]. Solutions to filtering [l]-[5] and smoothing [6]- 
[9] in linear dynamic systems are well known, and numerous 
approximations for nonlinear systems have been proposed, for 
both filtering [10]-[15] and smoothing [16]-[19]. 

In this article, we focus on Gaussian filtering and smoothing 
in Gaussian process (GP) dynamic systems. GPs are a robust 
non-parametric method for approximating unknown functions 
by a posterior distribution over them [20], [21]. Although GPs 
have been around for decades, they only recently became com- 
putationally interesting for applications in robotics, control, 
and machine learning [22]-[26]. 

The contribution of this article is the derivation of a novel, 
principled, and robust Rauch-Tung-Striebel (RTS) smoother 
for GP dynamic systems, which we call the GP-RTSS. The GP- 
RTSS computes a Gaussian approximation to the smoothing 
distribution in closed form. The posterior filtering and smooth- 
ing distributions can be computed without linearization [10] or 
(small) sampling approximations of densities [11]. 

We provide numerical evidence that the GP-RTSS is more 
robust than state-of-the-art nonlinear Gaussian filtering and 
smoothing algorithms including the extended Kalman filter 
(EKF) [10], the unscented Kalman filter (UKF) [11], the 
cubature Kalman filter (CKF) [15], the GP-UKF [12], and 
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their corresponding RTS smoothers. Robustness refers to the 
ability of an inferred distribution to explain the "true" state/ 
measurement. 

The paper is structured as follows: In Sees. I-A-I-B, we 
introduce the problem setup and necessary background on 
Gaussian smoothing and GP dynamic systems. In Sec. II, 
we briefly introduce Gaussian process regression, discuss the 
expressiveness of a GP, and explain how to train GPs. Sec. Ill 
details our proposed method (GP-RTSS) for smoothing in 
GP dynamic systems. In Sec. IV, we provide experimental 
evidence of the robustness of the GP-RTSS. Sec. V concludes 
the paper with a discussion. 

A. Problem Formulation and Notation 

In this article, we consider discrete-time stochastic systems 



Wt . 



(1) 

vt , (2) 

e R^ is the measurement 



x* = /(xt-i) 
Z4 == .g(xt) + 

where X( e R^ is the state, Z( 
at time step t, Wj ^ M{0,Y1^) is Gaussian system noise, 
vt ^ 7V(0,St,) is Gaussian measurement noise, / is the 
transition function (or system function) and g is the measure- 
ment function. The discrete time steps t run from to T. 
The initial state xq of the time series is distributed according 
to a Gaussian prior distribution p(xo) = Af {fiQ , 1]^) . The 
purpose of filtering and smoothing is to find approximations 
to the posterior distributions p(xt|zi:,-), where 1 : t in a 
subindex abbreviates 1,...,t with t = t during filtering 
and T = T during smoothing. In this article, we consider 
Gaussian approximations p{xt\zi.r) ~ A/'(xt | /x5^, S5^) of 
the latent state posterior distributions p(xj|zi.T-). We use the 
short-hand notation a^^ where a. — fi denotes the mean /x and 
a. — T, denotes the covariance, b denotes the time step under 
consideration, c denotes the time step up to which we consider 
measurements, and d G {x, z} denotes either the latent space 
(x) or the observed space (z). 

B. Gaussian RTS Smoothing 

Given the filtering distributions p{-x.t\'Zi-t) = 



AA(x,|/x-,S-), t 



1, . . . , T, a sufficient condition 



for Gaussian smoothing is the computation of Gaussian 
approximations of the joint distributions p(xj_i,xt|zi.j_i), 
< = 1,...,T[19]. 

In Gaussian smoothers, the standard smoothing distribution 
for the dynamic system in Eqs. (l)-(2) is always 

p(xt_i|zi:T) =A/'(xt_i|/x^_i|y,S^_i|y), where (3) 

t^t-i\T == (J-t-ilt-i + Jt-i(/^t|T - Mt|t) I (4) 

'^t-l\T ~ ^t-i|t-i + Jt-i(^t|T ~ ^t|t)Jt-i J (5) 

3t-i ■='St-i,t\t^ii'^t\t~iy^ ^ t = r, ...,1. (6) 

Depending on the methodology of computing this joint dis- 
tribution, we can directly derive arbitrary RTS smoothing 
algorithms, including the URTSS [16], the EKS [1], [10], the 
CKS [19], a smoothing extension to the CKF [15], or the GP- 
URTSS, a smoothing extension to the GP-UKF [12]. The in- 
dividual smoothers (URTSS, EKS, CKS, GP-based smoothers 



etc.) simply differ in the way of computing/estimating the 
means and covariances required in Eqs. (4)-(6) [19]. 

To derive the GP-URTSS, we closely follow the derivation 
of the URTSS [16]. The GP-URTSS is a novel smoother, but 
its derivation is relatively straightforward and therefore not 
detailed in this article. Instead, we detail the derivation of 
the GP-RTSS, a robust Rauch-Tung-Striebel smoother for GP 
dynamic systems, which is based on analytic computation of 
the means and (cross-)covariances in Eqs. (4)-(6). 

In GP dynamics systems, the transition function / and 
the measurement function g in Eqs. (l)-(2) are modeled by 
Gaussian processes. This setup is getting more relevant in 
practical applications such as robotics and control, where it 
can be difficult to find an accurate parametric form of / and 
g, respectively [25], [27]. Given the increasing use of GP 
models in robotics and control, the robustness of Bayesian 
state estimation is important. 

II. Gaussian Processes 

In the standard GP regression model, we assume that the 
data P := {X := [xi, . . . ,x„]^, y := [yi, . . . ,y„]^} have 
been generated according to yi = /i(xi) + e^, where h : 
R^ -^ R and e^ ^ Af{0,(j'^) is independent (measurement) 
noise. GPs consider h a random function and infer a posterior 
distribution over h from data. The posterior is used to make 
predictions about function values /i(x*) for arbitrary inputs 

X, e R^. 

Similar to a Gaussian distribution, which is fully specified 
by a mean vector and a covariance matrix, a GP is fully 
specified by a mean function rah{ ■ ) and a covariance function 

fc,,(x,x') :-E^[(/i(x)-m^(x))(/i(x')-TO/,(x'))] (7) 
= covh[h{x),h{x')] e R , x, x' e R^ , (8) 

which specifies the covariance between any two function 
values. Here, E/i denotes the expectation with respect to the 
function h. The covariance function kh{- , • ) is also called a 
kernel. 

Unless stated otherwise, we consider a prior mean function 
TO/i = and use the squared exponential (SE) covariance 
function with automatic relevance determination 



fcsE(Xp,Xq) 



'exp(-i(xp-Xg)^A i(xp-xg)) (9) 



for Xp, Xg e R , plus a noise covariance function fcnoise •= 
5pqa^, such that kh = A:sE+fcnoise- The S denotes the Kronecker 
symbol that is unity when p = q and zero otherwise, resulting 
in i.i.d. measurement noise. In Eq. (9), A = diag([i?^, . . . , £y\) 
is a diagonal matrix of squared characteristic length-scales 
£i, i = 1, . . . , D, and a^ is the signal variance of the latent 
function h. By using the SE covariance function from Eq. (9) 
we assume that the latent function h is smooth and stationary. 
Smoothness and stationarity is easier to justify than fixed 
parametric form of the underlying function. 



A. Expressiveness of the Model 

Although the SE covariance function and the zero-prior 
mean function are common defaults, they retain a great deal 



of expressiveness. Inspired by [20], [28], we demonstrate this 
expressiveness and show the correspondence of our GP model 
to a universal function approximator: Consider a function 



h{x) = J2 



lim — > 7„ exp 



(^-(^ + f))' 



A2 



(10) 



where 7„ -- 7V(0, 1) , n = 1, . . . , AT. Note that in the limit 
h{x) is represented by infinitely many Gaussian-shaped basis 
functions along the real axis with variance A^ and prior 
(Gaussian) random weights 7„, for x e R, and for all i E X. 
The model in Eq. (10) is considered a universal function 
approximator. Writing the sums in Eq. (10) as an integral over 
the real axis R, we obtain 



ri+l 



A2 



ds 



7(s)exp(-^^^)ds-(7*/C)(a;), (11) 



where 7(5) ^ A/'(0, 1) is a white-noise process and /C is a 
Gaussian convolution kernel. The function values of h are 
jointly normal, which follows from the convolution 7 * /C. We 
now analyze the mean function and the covariance function of 
h, which fully specify the distribution of h. The only random 
variables are the weights 7(5). Computing the expected func- 
tion of this model (prior mean function) requires averaging 
over j(s) and yields 



E^[Ma;)]= hix)p{j{s))dj{s) 



(12) 



(11) 



exp 



(x- sf 



7(s)p(7(s)) d'-f{s) ds — 
(13) 



since E-y[7(s)] = 0. Hence, the mean function of h equals zero 
everywhere. Let us now find the covariance function. Since the 
mean function equals zero, for any x,x' eH we obtain 

cov-y[h{x),h{x')] ^ / h{x)h{x')p{'j{s))d'y{s) 



exp 



{x - sf 
A2~ 



exp 



jx' - s)- 
A2~ 



X / -f{sfp{-f{s))d-f{s)ds, 



(14) 



where we used the definition of h in Eq. (11). Using 
var-y[7(s)] = 1 and completing the squares yields 



cov^[h{x), h{x')] 



exp 



2(,-^)V(^ 



A2 



r\2 



2 / {X - X') 

= «exp(^ 2>r 

for suitable a^. 

From Eqs. (13) and (15), we see that the mean function and 
the covariance function of the universal function approximator 
in Eq. (10) correspond to the GP model assumptions we made 
earlier: a prior mean function nih = and the SE covariance 
function in Eq. (9) for a one-dimensional input space. Hence, 



the considered GP prior implicitly assumes latent functions h 
that can be described by the universal function approximator 
in Eq. (11). Examples of covariance functions that encode 
different model assumptions are given in [21]. 

B. Training via Evidence Maximization 

For E target dimensions, we train E GPs assuming that the 
target dimensions are independent at a deterministically given 
test input (if the test input is uncertain, the target dimensions 
covary): After observing a data set V, for each (training) 
target dimension, we learn the D + 1 hyper-parameters of the 
covariance function and the noise variance of the data using 
evidence maximization [20], [21]: Collecting all {D + 2)E 
hyper-parameters in the vector 6, evidence maximization 
yields a point estimate 6 E argmaxg logp(y|X, 9). Evidence 
maximization automatically trades off data fit with function 
complexity and avoids overfitting [21]. 

From here onward, we consider the GP dynamics system 
setup, where two GP models have been trained using evidence 
maximization: GVf, which models the mapping x^^i h^ 
xt, H^ -^ H^, see Eq. (1), and GVg, which models the 
mapping xt i— > zt, R^ -^ R^, see Eq. (2). To keep the 
notation uncluttered, we do not explicitly condition on the 
hyper-parameters 6 and the training data V in the following. 

III. Robust Smoothing in Gaussian Process 
Dynamic Systems 

Analytic moment-based filtering in GP dynamic systems has 
been proposed in [13], where the filter distribution is given by 

p(x,|zi.,)=AA(xt|/x^l„S^|J, (16) 

(17) 

^t\t — -^tlt-l^ ■^t\t-l\^t\t-lj ^t|t-l : (^°) 

for t = 1, . . . ,T. Here, we extend these filtering results to 
analytic moment-based smoothing, where we explicitly take 
nonlinearities into account (no linearization required) while 
propagating full Gaussian densities (no sigma/cubature-point 
representation required) through nonlinear GP models. 

In the following, we detail our novel RTS smoothing 
approach for GP dynamic systems. We fit our smoother in 
the standard frame of Eqs. (4)-(6). For this, we compute the 
means and covariances of the Gaussian approximation 



Af 



Xf-l 
Xt 



/^t-i|t-i 



^t-i|t-i 

\^x 
^t,t-l|t-l 



£-_ 



i,t|t-i 

-yx 
■'t\t-l 



(19) 



to the joint p(xt_i,xt|zi:t_i), after which the smoother is 
fully determined [19]. Our approximation does not involve 
sampling, linearization, or numerical integration. Instead, we 
present closed-form expressions of a deterministic Gaussian 
approximation of the joint distribution in Eq. (19). 

In our case, the mapping Xi_i h^ Xf is not known, but 
instead it is distributed according to GVf, a distribution 
over system functions. For robust filtering and smoothing, we 
therefore need to take the GP (model) uncertainty into account 



by Bayesian averaging according to the GP distribution [13], 
[29]. The marginal p(xt_i|zi:t_i) = 7V(/x^_^|j_p S^_^|j_ J 
is known from filtering [13]. In Sec. III-A, we compute the 
mean and covariance of second marginal p{xt\zi.t-i) and 
then in Sec. III-B the cross-covariance terms SJ 

COv[Xt_l,Xt|zi:t_l]. 



■'t-l,t\t-l 



A. Marginal Distribution 

1) Marginal Mean: Using the system Eq. (1) and integrat- 
ing over all three sources of uncertainties (the system noise, 
the state xt_i, and the system function itself), we apply the 
law of total expectation and obtain the marginal mean 



K\t-i = IEx,_, [E/[/(x,_i)|x,_i]|zi.t-i] 



(20) 



The expectations in Eq. (20) are taken with respect to 
the posterior GP distribution p{f) and the filter distribution 
p(xt_i|zi.(_i) = 7V(/x^_^|j_^,S^_^|j_J at time step t - 1. 
Eq. (20) can be rewritten as fJ'^u_i — Ext_i [m/(xi_i)|zi:i_i] 
with ?TT./(xi_i) := E/[/(xt_i)|xt_i] is the posterior mean 
function of GVf. Writing nif as a finite sum over the SE 
kernels centered at all n training inputs [21], the predicted 
mean for each target dimension a = 1, . . . , _D is 

(/2^|j_l)a = / TO/^(xt_l)p(xt_l|zi:i_l)dXt_l (21) 

= X!j=i'^". / ^/a(xt-i,x,)p(xt_i|zi:i_i)dxt_i, 

where p(xt_i|zi:t_i) = 7V(xt_i | /x^_i|j_i, ^t-i\t-i) i^ the 
filter distribution at time t— 1. Moreover, x^, i = 1, . . . , n, are 
the training set of GVf, kf^ is the covariance function of GVf 
for the ath target dimension (GP hyper-parameters are not 
shared across dimensions), and /3^ :— (Kf^ + '^w '^)~^'ya G 
R". For dimension a, K/^ denotes the kernel matrix (Gram 
matrix), where K/^ — kf^(xi,Xj), i,j — l,...,n. More- 
over, y^ are the training targets, and cr^ is the learned system 
noise variance. The vector /3^ has been pulled out of the 
integration since it is independent of Xf_i. Note that Xj_i 
serves as a test input from the perspective of the GP regression 
model. 

For the SE covariance function in Eq. (9), the integral 
in (21) can be computed analytically (other tractable choices 
are covariance functions containing combinations of squared 
exponentials, trigonometric functions, and polynomials). The 
marginal mean is given as 



(M*|t-i)a - (/3^) ' qS 



(22) 



where we defined 

,2 



X cxp ( - i(x, - /x^_i|,_i)^S^i(x, - /x?^_i|t_i)) , (23) 
S:=S?_i|,_i+A„, (24) 

i — 1, . . . ,n, being the solution to the integral in Eq. (21). 
Here, a'i is the signal variance of the ath target dimension of 
GVf, a learned hyper-parameter of the SE covariance function, 
see Eq. (9). 



2) Marginal Covariance Matrix: We now explicitly com- 
pute the entries of the corresponding covariance H^,^_^. Using 
the law of total covariance, we obtain for a.b — 1, . . . ,D 



B. Cross-Covariance 

By the definition of a covariance and the system Eq. (1), 
the missing cross-covariance matrix S^-^ ^\^_i in Eq. (19) is 



(<») Jb)\ 



(S?|t-l)(ah) =COVx,_i,/,w[a;J ,x\ >i:t_i] 



Ext_i[cOV/,w[/a(xt- 



-WaJbi^t- 



(25) 

Wfc|xt_i]|zi.i_i] 



'^t-lMt-l = Ex,_i,/,wt [Xf_l (/(Xt_l) + Wt) |zi:t_l] 



covx 



_,[E/J/,(x,_i)|x,_i],E/J/fc(x,.i)|x,_i]|zi.,_i] 



^ tJ't-i\t-iif^t\t-i) 



(32) 



where we exploited in the last term that the system noise w 
has mean zero. Note that Eq. (25) is the sum of the covariance 
of (conditional) expected values and the expectation of a (con- 
ditional) covariance. We analyze these terms in the following. 
The covariance of the expectations in Eq. (25) is 



where H^_^,^_^ is the mean of the filter update at time step t— 
1 and fJ'fU_i is the mean of the time update, see Eq. (20). Note 
that we explicitly average out the model uncertainty about /. 
Using the law of total expectations, we obtain 

'^t-i,t\t-i =IEx,_Jxt_iE/,wJ/(xt_i)+Wt|xt_i]^|zi:t_i] 



/ "l/„(Xt-l)TO/,(Xt_i)p(Xt_i)dXt_i - ^j_J„^j_i 



b , 

(26) 



where we used that E/[/(xi_i)|xi_i] ~ mf{xt^i). With 
O: = (K„+<I)-V. andm/Jx,_i) - kfJX,^t-iV (3^, 
we obtain 

COVx,_jTO/Jxt_l),TO/,(xt_l)|zi:t-l] 

= {|3:)'^Q/3t~i^l^^,^M^^^^t^l)b■ (27) 
Following [30], the entries of Q e R"^" are given as 

QtJ - ^/a(x^,M"_lU_l)fc/,(xJ,/X?^_l|,_l)/v^ (28) 



= IEx4_i [X4_im/(xt_i)^|zi:t_i] 

~ ^'■t-l\t-lifj't\t-l) J 



(33) 



(34) 



exp( 



1 „T-r> — 1 v*^ 



2 y 



R^S 



t-llt-l'^ij) 



exp(n?.)/v^ 



nl^log{a}J + \og{ajJ 



UCjA-'Q + CjA,-^C, - zJ.R-iS^_,|,_,z,,-) , 



where we defined R := S^_^|j_^(A-i + A^^) + I, Ct ■= 



^^ - M?-i|t-i' ™d ^ij — ^a ^C 



KXr 



The expected covariance in Eq. (25) is given as 

IExt_i[cOV/[/a(xt_i),/t(xt_i)|xt_i]|zi:t_i] +5ab(yl,^ (29) 

since the noise covariance matrix T,^ is diagonal. Following 
our GP training assumption that different target dimensions 
do not covary if the input is deterministically given, Eq. (29) 
is only non-zero if a = fe, i.e., Eq. (29) plays a role only for 
diagonal entries of H^.^_^. For these diagonal entries (a = 6), 
the expected covariance in Eq. (29) is 

a}^ - tr((K/^ + <I)-^Q) + ^2^ . (30) 

Hence, the desired marginal covariance matrix in Eq. (25) is 

Eq. (27) + Eq. (30) if a = 6 



(^t'lt-l)ab 



Eq. (27) otherwise 



(31) 



We have now solved for the marginal distribution 
p(xt|zi:t_i) in Eq. (19). Since the approximate Gaussian filter 
distribution p(xt_i|zi.t_i) = 7V(/x^_^|j_^, S^_^|j_ J is also 
known, it remains to compute the cross-covariance S^ j^ t\t-i 
to fully determine the Gaussian approximation in Eq. (19). 



where we used the fact that Ey^wt [./ (xt-i) + wt|xi_i] — 
r7iy(xt_i) is the mean function of GVf, which models the 
mapping Xf_i h^ xt, evaluated at xt_i. We thus obtain 

^t^-l.tlt-l = / Xt_lTO/(xt_l)^p(xt_l|zi:i_l)dXt_l (35) 

^ A*t^-i|t-i(A'?|t-i) • 

Writing r7iy(xi_i) as a finite sum over kernels [21] and 
moving the integration into this sum, the integration in Eq. (35) 
turns into 

/ Xt_lTO/^(xt_l)p(xt_l|zi:i_l)dXf_l 
n „ 

^X!^a. / Xt_lfc/„(xt_i,Xi)p(xt_i|zi.i_i)dXt_i 
i=l •' 

for each state dimension a = 1, . . . ,D. With the SE covariance 
function ksE defined in Eq. (9), we compute the integral 
analytically and obtain 

/ Xi_im/^(xt_i)p(xt_i|zi.t_i)dXi_i (36) 

= Z1^«. / ^t-iC3A/'(x«,Aa)AA(/x^_i|,_i,S?^_i|,_i)dxt_i, 

D 

where we defined Cg = (a? (27r)2 -^lAal)^^, such that 
fc/^(xi_i,Xi) — C3A/'(xt_i |xi, Aa). In the definition of C3, 
ai is a hyper-parameter of GV f responsible for the variance 
of the latent function in dimension a. Using the definition of 
S in Eq. (24), the product of the two Gaussians in Eq. (36) 
results in a new (unnormalized) Gaussian c'^^ M {yit-i | ■0^, \t') 
with 

C4i = (27r)-f|A„ + SJ=_i|,_ir^ 

X cxp ( - i(x, - /x^_i|,_ J^S-i(x, - /x^_i|t_i)) , 



Pulling all constants outside the integral in Eq. (36), the 
integral determines the expected value of the product of the 



two Gaussians, ■0,. For a = I, . . . ,D, we obtain 



y2 c3C4'/3;j^V'«- 

Z ^2—1 ^ 



Using C3C4 ^ — Qa , see Eq. (23), and some matrix identities, 
we finally obtain 

(37) 

for covxj_i,/.wt[xt-ii2;t^|zi:i_i], and the joint covariance 
matrix of p(xt_i,Xi|zi:t_i) and, hence, the full Gaussian 
approximation in Eq. (19) is completely determined. 

With the mean and the covariance of the joint distribution 
p(xt_i,xt|zi:i_i) given by Eqs. (22), (31), (37), and the filter 
step, all necessary components are provided to compute the 
smoothing distribution p(xt|zi.y) analytically [19]. 

IV. Simulations 

In the following, we present results analyzing the robust- 
ness of state-of-the art nonlinear filters (Sec. IV-A) and the 
performances of the corresponding smoothers (Sec. IV-B). 

A. Filter Robustness 

We consider the nonlinear stochastic dynamic system 

''"'-^ ' - wt-^Af{0,<^l = 0.2'), (38) 



Xt 



xt-l 
2 



l+x'i 



Wt ■ 



zt ^ 5 sm{xt) + vt , vt^Af{0,al^0.2^). 



(39) 



which is a modified version of the model used in [18], [31]. 
The system is modified in two ways: First, Eq. (38) does not 
contain a purely time-dependent term in the system, which 
would not allow for learning stationary transition dynamics. 
Second, we substituted a sinusoidal measurement function for 
the originally quadratic measurement function used by [18] 
and [31]. The sinusoidal measurement function increases the 
difficulty in computing the marginal distribution p{zt\zi:t-i) 
if the time update distribution p(xt|zi:t_i) is fairly uncertain: 
While the quadratic measurement function can only lead to 
bimodal distributions (assuming a Gaussian input distribution), 
the sinusoidal measurement function in Eq. (39) can lead to 
an arbitrary number of modes — for a broad input distribution. 

The prior variance was set to CTq = 0.5^, i.e., the initial 
uncertainty was fairly high. The system and measurement 
noises (see Eqs. (38)-(39)) were relatively small considering 
the amplitudes of the system function and the measurement 
function. For the numerical analysis, a linear grid in the 
interval [—3,3] of mean values (^g)i, i — 1,...,100, was 
defined. Then, a single latent (initial) state xj^ was sampled 
from p(4'') = AA((Aig)„ all z = 1, . . . , 100. 

For the dynamic system in Eqs. (38)-(39), we analyzed the 
robustness in a single filter step of the EKF, the UKF, the CKF, 
an SIR PF (sequential importance resampling particle filter) 
with 200 particles, the GP-UKF, and the GP-ADF against the 
ground truth, closely approximated by the Gibbs-filter [19]. 
Compared to the evaluation of longer trajectories, evaluating 
a single filter step makes it easier to analyze the robustness of 
individual filtering algorithms. 



Tab. I summarizes the expected performances (RMSE: root- 
mean-square error, MAE: mean-absolute error, NLL: negative 
log-likelihood) of the EKF, the UKF, the CKF, the GP-UKF, 
the GP-ADF, the Gibbs-filter, and the SIR PF for estimating 
the latent state x. The results in the table are based on averages 
over 1,000 test runs and 100 randomly sampled start states 
per test run (see experimental setup). The table also reports 
the 95% standard error of the expected performances. The • 
indicates a method developed in this paper. Tab. I indicates 
that the GP-ADF is the most robust filter and statistically 
significantly outperforms all filters but the sampling-based 
Gibbs-filter and the SIR PF. The green color highlights a 
near-optimal Gaussian filter (Gibbs-filter) and the near-optimal 
particle filter Amongst all other filters the GP-ADF is the 
closest Gaussian filter to the computationally expensive Gibbs- 
filter [19]. Note that the SIR PF is not a Gaussian filter and 
is able to express multi-modality in distributions. Therefore, 
its performance is typically better than the one of Gaussian 
filters. The difference between the SIR PF and a near-optimal 
Gaussian filter, the Gibbs-filter, is expressed in Tab. I. The 
performance difference essentially depicts how much we lose 
by using a Gaussian filter instead of a particle filter. The NLL 
values for the SIR PF are obtained by moment-matching the 
particles. 

The poor performance of the EKF is due to linearization 
errors. The filters based on small sample approximations of 
densities (UKF, GP-UKF, CKF) suffer from the degeneracy 
of these approximations, which is illustrated in Fig. 1. Note 
that the CKF uses a smaller set of cubature points than the 
UKF to determine predictive distributions, which makes the 
CKF statistically even less robust than the UKF. 

B. Smoother Robustness 

We consider a pendulum tracking example taken from [13]. 
We evaluate the performances of four filters and smoothers, 
the EKF/EKS, the UKF/URTSS, the GP-UKF/GP-URTSS, the 
CKF/CKS, the Gibbs-filter/smoother, and the GP-ADF/GP- 
RTSS. The pendulum has mass to = 1 kg and length I — Im. 
The state x = [(p,(p]^ of the pendulum is given by the 
angle tp (measured anti-clockwise from hanging down) and 
the angular velocity ip. The pendulum can exert a constrained 
torque u <E [—5, 5] Nm. We assumed a frictionless system such 
that the transition function / is 



./(xt,Mt) 



t+At 



x{r) — 0.5 mlg sin <p{t) 
0.25mP+I 



dr, 



(40) 



where / is the moment of inertia and g the acceleration of 
gravity. Then, the successor state 



Xt+l =Xt+At = fi^t^Ut 



-Wt 



(41) 



was computed using an ODE solver for Eq. (40) with a zero- 
order hold control signal u{t). In Eq. (41), we set "S^ = 
diag([0.5^, 0.1^]). In our experiment, the torque was sampled 
randomly according to u ^ Z^[— 5,5]Nm and implemented 
using a zero-order-hold controller. Every time increment At — 
0.2 s, the state was measured according to 



Zt — arctan ( 



-i-(f 



tt(yt) 

0.5 — / cos(<pt) 



) +vt, ol 



0.05^ 



(42) 



TABLE I 

Average filter performances (RMSE, MAE, NLL) with standard errors (95% confidence interval) and p-values testing the 

HYPOTHESIS THAT THE OTHER FILTERS ARE BETTER THAN THE GP-ADF USING A ONE-SIDED T-TEST. 



RMSE^ (p-value) 



MAEo; (p-value) 



NLLa; (p-value) 



3.62 ± 0.212 (p = 4.1 X 10^^) 

10.5 ± 1.08 (p< 10-"') 

9.24 ±1.13 (p = 2.8 X 10""') 



2.36 ±0.176 (p = 0.38) 3.05 x 10^ ± 3.02 x 10^ 



< < 10-*) 



EKF [10] 
UKF [11] 
CKF [15] 



8.58 ±0.915 (p< 10~*) 
7.31 ±0.941 (p = 4.2 X lO"*) 



25.6 ±3.39 (p < 10-*) 
2.22 X 10^ ± 17.5 (p < 10-*) 



5.36 ± 0.461 (p = 7.9 x IQ-*) 
2.85 ±0.174 



3.84 ±0.352 (p = 3.3 X 10-^) 
2.17±0.151 



GP-UKF [12] 
GP-ADF [13] 



6.02 ±0.497 (p < 10- 
1.97 ±6.55 X 10- 



Gibbs-filter [19] 
SIRPF 



2.82 ± 0.171 (p = 0.54) 2.12 ± 0.148 (p = 0.56) 

1.57 ± 7.66 X 10-2 (p ^ ;^ Q) Q 3g j. 3 28 X 10-^ (p = 1.0) 



1.96 ±6.62 X 10- 
1.03 ±7.30 X 10- 



(p = 



0.55) 
= 1.0) 





(a) UKF time update p(i'i|0), which misses out substantial prob- 
ability mass of the true predictive distribution. 



-25 -20 -15-10 -5 

(b) UKF determines p(zi|0), which is too sensitive and cannot 
explain the actual measurement zi (black dot, left sub-figure). 



Fig. 1. Degeneracy of the unscented transformation (UT) underlying the UKF. Input distributions to the UT are the Gaussians in the sub-figures at the 
bottom in each panel. The functions the UT is applied to are shown in the top right sub-figures, i.e. the transition mapping, Eq. (38), in Panel (a), and the 
measurement mapping, Eq. (39), in Panel (b). Sigma points are marked by red dots. The predictive distributions are shown in the left sub-figures of each 
panel. The true predictive distributions are the shaded areas; the UT predictive distributions are the solid Gaussians. The predictive distribution of the time 
update in Panel (a) equals the input distribution at the bottom of Panel (b). 



Note that the scalar measurement Eq. (42) solely depends 
on the angle. Thus, the full distribution of the latent state x 
had to be reconstructed using the cross-correlation information 
between the angle and the angular velocity. 

Trajectories of length T = 6s = 30 time steps were started 
from a state sampled from the prior p(xo) = J\f{iJLQ,'Eo) 
with ^l„ = [0,0]^ and Eq = diag([0.01^ ( f^)^]). For each 
trajectory, GP models GVf and OVg are learned based on 
randomly generated data using either 250 or 20 data points. 

Tab. II reports the expected values of the NLL-^-measure 
for the EKF/EKS, the UKF/URTSS, the GP-UKF/GP-URTSS, 
the GP-ADF/GP-RTSS, and the CKF/CKS when tracking 
the pendulum over a horizon of 6 s, averaged over 1,000 
runs. As in the example in Sec. IV-A, the NLL^j-measure 
emphasizes the robustness of our proposed method: The 
GP-RTSS is the only method that consistently reduced the 
negative log-likelihood value compared to the corresponding 
filtering algorithm. Increasing the NLL^j-values (red color in 
Tab. II) occurs when the filter distribution cannot explain 
the latent state/measurement, an example of which is given 
in Fig. 1(b). Even with only 20 training points, the GP- 
ADF/GP-RTSS outperform the commonly used EKF/EKS, 
UKF/URTSS, CKF/CKS. 

We experimented with even smaller signal-to-noise ratios. 
The GP-RTSS remains robust, while the other smoothers 
remain unstable. 

V. Discussion and Conclusion 

In this paper, we presented GP-RTSS, an analytic Rauch- 
Tung-Striebel smoother for GP dynamic systems, where the 



GPs with SE covariance functions are practical implementa- 
tions of universal function approximators. We showed that the 
GP-RTSS is more robust to nonlinearities than state-of-the- 
art smoothers. There are two main reasons for this: First, 
the GP-RTSS relies neither on linearization (EKS) nor on 
density approximations (URTSS/CKS) to compute an optimal 
Gaussian approximation of the predictive distribution when 
mapping a Gaussian distribution through a nonlinear function. 
This property avoids incoherent estimates of the filtering and 
smoothing distributions as discussed in Sec IV-A. Second, GPs 
allow for more robust "system identification" than standard 
methods since they coherently represent uncertainties about 
the system and measurement functions at locations that have 
not been encountered in the data collection phase. The GP- 
RTSS is a robust smoother since it accounts for model 
uncertainties in a principled Bayesian way. 

After training the GPs, which can be performed offline, the 
computational complexity of the GP-RTSS (including filtering) 
is 0{T{E'^ + n'^{D^ + E^))) for a time series of length T. 
Here, n is the size of the GP training sets, and D and E are 
the dimensions of the state and the measurements, respectively. 
The computational complexity is due to the inversion of the D 
and i?-dimensional covariance matrices, and the computation 
of the matrix Q e R"^" in Eq. (28), required for each entry of 
a D and i?-dimensional covariance matrix. The computational 
complexity scales linearly with the number of time steps. The 
computational demand of classical Gaussian smoothers, such 
as the URTSS and the EKS is 0{T{D^ + E^)). Although 
not reported here, we verified the computational complexity 
experimentally. Approximating the online computations of the 



TABLE II 

Expected filtering and smoothing performances (pendulum tracking) with 95% confidence intervals. 



Filters 


EKF [10] 


UKF [11] 


CKF [15] 


GP-UKF250 [12] 


GP-ADF250 [13] 


GP-ADF20 [13] 


E[NLL,] 


1.6 X 10''±29.1 


6.0 ±3.02 


28.5 ±9.83 


4.4 ±1.32 


1.44 ±0.117 


6.63 ±0.149 


Smoothers 


EKS [10] 


URTSS [16] 


CKS [19] 


GP-URTSS^r^o 


GP-RTSS|,j(, 


GP-RTSS*,, 


E[NLL,] 


3.3 X 10''±60.5 


17.2 ±10.0 


72.0 ±25.1 


10.3 ±3.85 


1.04 ±0.204 


6.57 ±0.148 



GP-RTSS by numerical integration or grids scales poorly with 
increasing dimension. These problems already appear in the 
histogram filter [3]. By explicitly providing equations for the 
solution of the involved integrals, we show that numerical 
integration is not necessary and the GP-RTSS is a practical 
approach to filtering in GP dynamic systems. 

Although the GP-RTSS is computationally more involved 
than the URTSS, the EKS, and the CKS, this does not 
necessarily imply that smoothing with the GP-RTSS is slower: 
function evaluations, which are heavily used by the EKS/CKS/ 
URTSS are not necessary in the GP-RTSS (after training). 
In the pendulum example, repeatedly calling the ODE solver 
caused the EKS/CKS/URTSS to be slower than the GP-RTSS 
(with 250 training points) by a factor of two. 

The increasing use of GPs for model learning in robotics and 
control will eventually require principled smoothing methods 
for GP models. To our best knowledge, the proposed GP-RTSS 
is the most principled GP-smoother since all computations can 
be performed analytically exactly, i.e., without function lin- 
earization or sigma/cubature point representation of densities, 
while exactly integrating out the model uncertainty induced 
by the GP distribution. 

Code will be made publicly available at littp : //mloss . 
org. 
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